Goto

Collaborating Authors

 federated online learning


A Large-Scale Web Search Dataset for Federated Online Learning to Rank

Gregoriadis, Marcel, Kang, Jingwei, Pouwelse, Johan

arXiv.org Artificial Intelligence

The centralized collection of search interaction logs for training ranking models raises significant privacy concerns. Federated Online Learning to Rank (FOLTR) offers a privacy-preserving alternative by enabling collaborative model training without sharing raw user data. However, benchmarks in FOLTR are largely based on random partitioning of classical learning-to-rank datasets, simulated user clicks, and the assumption of synchronous client participation. This oversimplifies real-world dynamics and undermines the realism of experimental results. We present AOL4FOLTR, a large-scale web search dataset with 2.6 million queries from 10,000 users. Our dataset addresses key limitations of existing benchmarks by including user identifiers, real click data, and query timestamps, enabling realistic user partitioning, behavior modeling, and asynchronous federated learning scenarios.


Federated Online Learning for Heterogeneous Multisource Streaming Data

Li, Jingmao, Chen, Yuanxing, Ma, Shuangge, Fang, Kuangnan

arXiv.org Machine Learning

Federated learning has emerged as an essential paradigm for distributed multi-source data analysis under privacy concerns. Most existing federated learning methods focus on the ``static" datasets. However, in many real-world applications, data arrive continuously over time, forming streaming datasets. This introduces additional challenges for data storage and algorithm design, particularly under high-dimensional settings. In this paper, we propose a federated online learning (FOL) method for distributed multi-source streaming data analysis. To account for heterogeneity, a personalized model is constructed for each data source, and a novel ``subgroup" assumption is employed to capture potential similarities, thereby enhancing model performance. We adopt the penalized renewable estimation method and the efficient proximal gradient descent for model training. The proposed method aligns with both federated and online learning frameworks: raw data are not exchanged among sources, ensuring data privacy, and only summary statistics of previous data batches are required for model updates, significantly reducing storage demands. Theoretically, we establish the consistency properties for model estimation, variable selection, and subgroup structure recovery, demonstrating optimal statistical efficiency. Simulations illustrate the effectiveness of the proposed method. Furthermore, when applied to the financial lending data and the web log data, the proposed method also exhibits advantageous prediction performance. Results of the analysis also provide some practical insights.


Unlearning for Federated Online Learning to Rank: A Reproducibility Study

Tao, Yiling, Wang, Shuyi, Yang, Jiaxi, Zuccon, Guido

arXiv.org Artificial Intelligence

This paper reports on findings from a comparative study on the effectiveness and efficiency of federated unlearning strategies within Federated Online Learning to Rank (FOLTR), with specific attention to systematically analysing the unlearning capabilities of methods in a verifiable manner. Federated approaches to ranking of search results have recently garnered attention to address users privacy concerns. In FOLTR, privacy is safeguarded by collaboratively training ranking models across decentralized data sources, preserving individual user data while optimizing search results based on implicit feedback, such as clicks. Recent legislation introduced across numerous countries is establishing the so called "the right to be forgotten", according to which services based on machine learning models like those in FOLTR should provide capabilities that allow users to remove their own data from those used to train models. This has sparked the development of unlearning methods, along with evaluation practices to measure whether unlearning of a user data successfully occurred. Current evaluation practices are however often controversial, necessitating the use of multiple metrics for a more comprehensive assessment -- but previous proposals of unlearning methods only used single evaluation metrics. This paper addresses this limitation: our study rigorously assesses the effectiveness of unlearning strategies in managing both under-unlearning and over-unlearning scenarios using adapted, and newly proposed evaluation metrics. Thanks to our detailed analysis, we uncover the strengths and limitations of five unlearning strategies, offering valuable insights into optimizing federated unlearning to balance data privacy and system performance within FOLTR. We publicly release our code and complete results at https://github.com/Iris1026/Unlearning-for-FOLTR.git.


How to Forget Clients in Federated Online Learning to Rank?

Wang, Shuyi, Liu, Bing, Zuccon, Guido

arXiv.org Artificial Intelligence

Data protection legislation like the European Union's General Data Protection Regulation (GDPR) establishes the \textit{right to be forgotten}: a user (client) can request contributions made using their data to be removed from learned models. In this paper, we study how to remove the contributions made by a client participating in a Federated Online Learning to Rank (FOLTR) system. In a FOLTR system, a ranker is learned by aggregating local updates to the global ranking model. Local updates are learned in an online manner at a client-level using queries and implicit interactions that have occurred within that specific client. By doing so, each client's local data is not shared with other clients or with a centralised search service, while at the same time clients can benefit from an effective global ranking model learned from contributions of each client in the federation. In this paper, we study an effective and efficient unlearning method that can remove a client's contribution without compromising the overall ranker effectiveness and without needing to retrain the global ranker from scratch. A key challenge is how to measure whether the model has unlearned the contributions from the client $c^*$ that has requested removal. For this, we instruct $c^*$ to perform a poisoning attack (add noise to this client updates) and then we measure whether the impact of the attack is lessened when the unlearning process has taken place. Through experiments on four datasets, we demonstrate the effectiveness and efficiency of the unlearning strategy under different combinations of parameter settings.